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SYSTEM AND METHOD FOR SEARCHING 
INFORMATION STORED ON A NETWORK 

BACKGROUND OF THE INVENTION 

The field of the invention is searching, and in particular 
searching for information stored in a set of websites. 

A website ("site") is defined herein as a collection of files 
stored on a computer (e.g., a server) that is connected to a 
network. The World Wide Web (WWW) is a collection of 
websites whose servers are interconnected through the Inter- 
net, A collection of websites can also be stored on servers 
that are interconnected through a private network, e.g., 
through an intranet. 

In many cases, at least some of the files of a website 
contain hyperlinks, A hyperlink is typically a text, graphic or 
image object in a first file that, when selected by a user, 
either causes a second file to be displayed to the user, causes 
a different part of the first file to be displayed to the user, or 
executes a program. In this way, a file in a website can be 
interrelated with another file stored at the same website, a 
different website, or elsewhere. The interrelated files of a 
single website usually reflect a common theme, such as 
information about a particular company, activity, or service. 

The amount of information stored in a collection of 
websites can be substantial. For example, the WWW 
includes over 600,000 websites. Conservatively assuming 
an average data size of 2 Megabytes (MB) per website, the 
WWW includes over 1200 billion bytes of information 
across a wide range of topics. Finding a particular piece of 
information in such a large collection can be problematic. 
For example, simple browsing through the websites in 
search of a particular type of information can be impractical 
in a website collection of substantial size. 

One known system addresses the problem of finding 
particular information stored at websites by categorizing 
websites according to the topic or topics to which they 
pertain. One such known system is the Yahoo! search engine 
located at <http:Wwww.yahoo.com>. Yahoo! obtains infor- 
mation about the topic or theme to which a website pertains 
along with a brief narrative describing the contents of the 
website (i.e., from the administrator or owner of the 
website). This information (along with a website identifier) 
is then correlated with a category. The Yahoo! categories are 
organized hierarchically, so that a given category typically 
has one or more subcategories, and each such subcategory 
has further subcategories, etc. 

An example of a Yahoo! interface is shown in FIG. 1. An 
example of a category is Arts&Humanities, 101, which has 
subcategories Literature 102 and Photography 103. When a 
user selects the Literature subcategory 102, Yahoo! displays 
the page shown in FIG. 2 to the user. FIG, 2 shows numerous 
subcategories 201 of the Literature subcategory 102. 
Hereinafter, the term "category" will be used interchange- 
ably with the term "subcategory." 

Yahoo! also accommodates keyword searching. In FIG. 2, 
a user has entered a search for the keyword "telephone" 202 
that is restricted 203 to websites in the Literature category. 
In this case, the user may be interested in finding literature 
where the telephone plays a major role. When the search 
button 204 is selected, only website descriptions, and not 
website content, that fall under the category "Literature" are 
searched for the term "telephone." Website descriptions are 
generally terse, one line or one paragraph summaries 
describing the content of the website. A website description 
cannot fully capture all of the detail contained in the 
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website's content. Indeed, by definition, it is a summary. 
Because only the descriptions are keyword searched, and not 
the content, a Yahoo! keyword search can disadvantageous^ 
miss relevant content even when the keyword search is 

5 limited to website descriptions in a relevant category. Web- 
sites whose descriptions contain the term "telephone" are 
displayed to the user, as shown in FIG. 3. 

As discussed above, because Yahoo! keyword searches 
only search the descriptions of websites and not their 

10 content, a Yahoo! keyword search can miss identifying 
websites that contain information relevant to the user's 
request. Thus, for example, many files at different websites 
in the Literature category may well contain the keyword 
"telephone." None of these would be detected and displayed 

15 to the user by Yahoo!, even though the user is interested in 
finding occurrences of "telephone" in websites that fall 
within the Literature category. In this way, the Yahoo! -type 
category/descriptive information search is overly narrow, 
and is prone to miss detecting information that the user 

20 would be interested in seeing. 

Another known system for searching for information at 
websites stores and indexes a vast amount of content from 
numerous websites, but does not correlate website content 
with categories. Such a known system is the AltaVista™, 

25 located at <http://www.altavista.digital.com>. In 
AltaVista™, a user submits a keyword search. FIG. 4 shows 
the AltaVista™ interface in which a user has submitted a 
keyword search request for the term "AT&T" 401. In 
response, AltaVista™ searches its stored content for occur- 

30 rences of the term "AT&T*, and shows the user the websites 
that have content in which the term occurs (402.) Some 
excerpted content (e.g., 403) is also displayed. It is difficult 
for the user to efficiently and accurately identify websites 
that have content of interest to the user. 

35 

Just as the Yahoo! -type search can be too narrow, the 
AltaVista™ -type content search can be too broad. For 
example, the results for the keyword search shown in FIG. 
4 include over 300,000 websites 404. Even when the results 
40 are organized in some prioritized fashion (e.g., websites with 
the greatest number of occurrences of the keyword term are 
listed first), such a broad result is too large to be very useful 
to the user. 

Searching by category and then using a keyword search to 
45 search the descriptive information about websites within a 
category can be too narrow, and miss detecting websites that 
have content that is relevant to the user's request. On the 
other hand, keyword searching of only the content of web- 
sites can be too broad. A way is needed to take advantage of 
50 the narrowing effect of a category search and the depth of a 
content search to yield a more accurate and complete search 
result, 

SUMMARY OF THE INVENTION 

55 In accordance with an embodiment of the present 
invention, websites are searched for desired information first 
by narrowing the scope of the search by identifying websites 
that correspond with a category pertinent to the desired 
information. Next, a keyword search is carried out on the 

60 content (not just the descriptions or summaries of content) of 
websites that fall within the pertinent category. This is 
advantageously more efficient than searching all of the 
content of the universe of websites initially, because such a 
search often disadvantageous^ returns too many results, 

65 many of which can be irrelevant (e.g., as in Altavista™) 
Likewise, it provides higher resolution than simply perform- 
ing a category search, which can fail to identify websites 
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within the category that have the most relevant information. 
It also provides higher resolution than narrowing the field of 
websites by category, and then performing a keyword search 
on website descriptions or content summaries, e.g., as in 
Yahoo! , which can miss relevant information that is included 5 
in the content itself, but not in the description or summary. 
The present invention advantageously combines the effi- 
ciency and accuracy of category and content searching to 
provide a more efficient, better way of finding the informa- 
tion most relevant to a user's need in a set of websites. 10 

BRIEF DESCRIPTION OF THE DRAWINGS 

FIG. 1 shows an interface to a prior art embodiment of a 
category /descriptive information search engine. 

FIG. 2 shows a keyword search request for websites that 
fall within a subcategory of the prior art search engine 
shown in FIG. 1. 

FIG. 3 shows the results of the keyword search request 
submitted as shown in FIG. 2 in the prior art search engine 20 
shown in FIG. 1. 

FIG. 4 shows an interface and a keyword search request 
to a prior art embodiment of a content search engine. 

FIG. 5 shows a system in accordance with an embodiment 
of the present invention. 25 

FIG. 6 is a flow chart illustrating an embodiment of the 
method in accordance with an embodiment of the present 
invention. 

FIG, 7 shows an interface in accordance with an embodi- 30 
ment of the present invention. 

FIG. 8 shows an interface that displays categories for user 
selection in accordance with an embodiment of the present 
invention. 

FIG. 9 shows an interface that displays subcategories of 35 
the categories shown in the interface depicted in FIG. 8 for 
user selection in accordance with an embodiment of the 
present invention. 

FIG. 10 shows the results of a content search after 
category selection in accordance with an embodiment of the 40 
present invention. 

DETAILED DESCRIPTION 

The present invention provides a system and a method 45 
that advantageously combines the best aspects of category 
searching and content searching of websites in a way that 
enables a user to more accurately and completely identify 
websites with content of interest to the user, especially in a 
large collection of websites. 50 

A system in accordance, with an embodiment of the 
present invention is shown in FIG. 5. A search computer 501 
is connected to a network 502 to which users 503 and sites 
504 are also connected. The search computer 501 includes a 
processor 505, a memory 506 and a port 507. The memory 55 
506 and the port 507 are coupled to the processor 505. The 
memory 506 stores website content correlated with catego- 
ries 508. The memory 506 further stores category -content 
search instructions 509 adapted to be executed by the 
processor 505 to retrieve content from websites over a 60 
network and cause the retrieved content to be stored, to 
correlate a piece of content with a category, to receive a 
category selection from a user, to receive a keyword search 
from the user, and then to perform a content search on that 
stored website content which is correlated with the selected 65 
category. The term "correlated with the selected category" 
encompasses subcategories in embodiments having a hier- 
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archical categorization scheme. The category-content 
instructions 509 are further adapted to be executed by the 
processor 505 to send the results of a search to the user. 

In one embodiment of the present invention, website 
content is automatically gathered and stored using a soft- 
ware application called a spider, such as the Vspider, manu- 
factured by Verity, Inc. of Sunnyvale, Calif. A spider is a 
computer program that automatically seeks out information 
(i.e., content) distributed on various nodes of a network 
(e.g., at websites on the Internet, or on an intranet) and sends 
it back to a predetermined location (e.g., the spider's "home 
server") such as a search computer shown as 501 in FIG. 5. 
A spider such as Vspider can advantageously be used to 
collect the content to be searched in accordance with the 
present invention. 

In one embodiment, the content that is retrieved by a 
spider is stored in a database. The database is coupled to a 
search computer, such as search computer 501 shown in 
FIG. 5. The content is searchable in the database using a 
known database search language, such as SQL. 

In one embodiment, the Vspider is given the Uniform 
Resource Locator (URL) of a website. Vspider then searches 
the file corresponding to the URL, and identifies links from 
that file to other pages (the terms file and page are equivalent 
as used herein), which it proceeds to search. Upon searching 
a page, Vspider returns information such as the identity of 
the author of the page, the date on which the page was 
created, its size and some analysis of its textual content, 
possibly including at least a part of the textual content itself. 
An embodiment of the present invention advantageously 
uses the Verity spider in this fashion to automatically and 
efficiently gather website content, as well as information 
about the website. 

In one embodiment, the processor 505 is a 
microprocessor, such as the Pentium II processor manufac- 
tured by the Intel Corporation of Santa Clara, Calif. In 
another embodiment, the processor 505 is an Application 
Specific Integrated Circuit (ASIC) which at least partly 
embodies the category -content instructions 509, the rest of 
which (if any) are stored in the memory 506. 

Embodiments of memory 506 include read-only memory 
(ROM), random access memory (RAM), a hard disk, a 
compact disc, a database, or any other device adapted to 
store information in digital form, or any combination 
thereof. 

The term "adapted to be executed by the processor" is 
meant to encompass instructions that are compressed, 
encrypted, uncompiled, or must otherwise be processed in 
order to be executed by the processor 505. Machine lan- 
guage or any other format of instruction that can be executed 
by the processor 505 without further manipulation are also 
meant to be encompassed by this term. 

A method in accordance with an embodiment of the 
present invention is now described with reference to the flow 
chart shown in FIG. 6. Website content is retrieved through 
a network (step 301), and is stored (step 302.) A piece of 
stored website content is correlated with a category (step 
303.) A category selection is received from a user (step 304.) 
A content search request (e.g., a keyword search request) for 
websites in the selected category is received from the user 
(step 305.) A content search on the stored website content 
that is correlated with the selected category is then per- 
formed (step 306.) The results of this category-content 
search are sent to the user (step 307.) 

FIG. 7 shows an interface for an embodiment of the 
present invention through which a user selects a category. 
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Categories 701 are listed under the heading "Search by 
Subject." For example, a user selects the "Products and 
Services" category 702, which causes the interface shown in 
FIG. 8 to be displayed. The user then selects the subcategory 
"AT&T WorldNet™ Services" 801 (shown in FIG. 8), which 
causes the interface shown in FIG. 9 to be displayed. As 
shown in FIG. 9, the user then submits a search for the 
keyword "telephone" 901. A content search for files in which 
the term "telephone" occurs is performed on content (e.g., 
riles) stored from websites that fall into the category "AT&T 
WorldNet™ Services." The results of the search are dis- 
played to the user in one embodiment as a dynamically 
generated web page, such as the one shown in FIG. 10. The 
term "dynamically generated web page" means a web page 
that includes content specifically tailored to respond to the 
user query. 

In one embodiment of the present invention, a dynamic 
index is stored that includes a list of identifiers (e.g., URLs) 
for websites that are associated with a selected category. The 
dynamic index is used to track the identities of all websites 
that correspond to a selected category or categories. For 
example, in a hierarchical category system wherein a cat- 
egory includes certain other categories (e.g., the literature 
category includes the classics and modern romance 
categories), a dynamic index includes identifiers for all 
websites in the selected category and its subcategories. 
When a user further narrows a category selection, the 
identifiers of newly excluded websites are dropped from the 
dynamic index. Likewise, when a user broadens a category 
selection, the identifiers of newly included websites are 
added to the dynamic index. 

A content search in one embodiment searches all of the 
content of all of the pages that comprise a website that falls 
within the selected category or categories. In another 
embodiment, the content search is performed by searching a 
subset of the content stored at the website in the selected 
category. For example, the content search can be restricted 
to the contents of metatags in the pages of the website. A 
metatag is defined herein as a subset of content marked -off 
from other content in a page. For example, the following line 
of text is embedded in a page at a website: 

This is the content that will not be searched <METATAG> 
and this is the content that will be searched </METATAG> 
That is, the content between <METATAG> and 
</METATAG> will be searched, while the rest will not be 
searched. 

Files that contains the term "telephone" are shown (1001) 
ranked in order where a file with more occurrences of the 
keyword is shown before a file with fewer occurrences. The 
name of the file (or site) 1002 is displayed, along with an 
excerpt of content (1003) from the file or site. A hyperlink 
(1004) to the site or file is also provided, as well as an 
indication of the file's size (1005.) The number of the results 
(1006) returned for a search in accordance with the present 
invention is typically substantially smaller (and therefore 
more manageable) than the number of results returned for an 
identical search request submitted to AltaVista™, Also, the 
present invention advantageously provides more compre- 
hensive and accurate results than a comparable Yahoo! 
search in many cases. The advantageous combination of 
category and content searching provided in accordance with 
the present invention produces website search results that, 
are more accurate and comprehensive than the results pro- 
vided by known systems. 

Although several embodiments are specifically illustrated 
and described herein, it will be appreciated that modifica- 
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tions and variations of the present invention are covered by 
the above teachings and within the purview of the appended 
claims without departing from the spirit and intended scope 
of the invention. 
What is claimed is: 

1. A method for searching for information stored at 
websites, comprising the steps of: 

a. retrieving website content through a network; 

b. correlating a piece of retrieved website content with a 
category; 

c. receiving a category selection; 

d. receiving a content search request for content in the 
selected category; and 

e. performing a content search on retrieved website con- 
tent that is correlated with the selected category. 

2. The method of claim 1, further comprising the steps of: 

f. receiving description information for a website from a 
registrant; and 

g. correlating the website with a category based upon the 
description information. 

3. The method of claim 1, further comprising the step of 
presenting a first category name to a user as a hyperlink to 
a second category name, the second category being a sub- 
category of the first category. 

4. The method of claim 1, wherein the step of performing 
the content search includes the steps of: 

a. maintaining a dynamic index that includes a list of 
identifiers for websites that are associated with the 
selected category; 

b. searching a representation of the content of each 
website whose identifier occurs in the dynamic index; 
and 

c. sending the results of the search to the user, 

5. The method of claim 1, wherein performing a content 
search includes the step of performing a keyword search. 

6. The method of claim 1, wherein the content search 
includes performing a keyword search on the contents of 
metatags stored in pages at the website. 

7. The method of claim 1, wherein the step of performing 
the content search includes the steps of: 

a. maintaining a web page index that includes a list of 
identifiers for web pages that comprise a website; 

b. receiving a website selection from a user; 

c. receiving a web page content search request from the 
user; 

d. searching the content of the web pages that comprise 
the selected website based upon the web page content 
search request from the user; and 

e. sending the results of the web page content search to the 
user. 

8. The method of claim 1, wherein a category selection is 
a Uniform Resource Locator. 

9. The method of claim 7, wherein the step of sending the 
results of the web page content search to the user includes 
the step of sending the Uniform Resource Locator of a web 
page in which information responsive to the user web page 
content request is stored. 

10. The method of claim 1, wherein the step of performing 
the content search includes the steps of: 

a. maintaining a web page index that includes a list of 
identifiers for web pages that comprise a website; 

b. receiving a website selection from a user; 

c. receiving a web page Uniform Resource Locator search 
request from the user; 
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d, searching the Uniform Resource Locators of the web 
pages that comprise the selected website based upon 
the web page Uniform Resource Locator search request 
from the user; and 

e. sending the results of the web page Uniform Resource 5 
Locator search to the user. 

11. The method of claim 1, further comprising the step of 
sending the results of the content search to a user. 

12. The method of claim 11, wherein the results sent to the 
user are adapted to be displayed ranked in the order of their 10 
relevance such that a more relevant result is displayed before 

a less relevant result. 

13. The method of claim 11, wherein the results of the 
content search are sent to the user as a dynamically gener- 
ated web page. 15 

14. The method of claim 11, wherein the results sent to the 
user include a website identifier and information pertaining 
to the content of the website corresponding to the identifier. 

15. An apparatus for searching for information stored at 
websites, comprising: 

a. a processor; 

b. a memory that stores category-content search instruc- 
tions adapted to be executed by said processor to 
retrieve content from websites, store the retrieved 25 
content, correlate a piece of stored content to a 
category, receive a category selection, receive a content 
search request, perform a content search of stored 
website content that is correlated with the selected 
category, and to send the results of the content search 3Q 
to a user, said memory coupled to said processor; and 

c. a port adapted to be coupled to a network, said port 
coupled to said processor and said memory. 

16. The apparatus of claim 15, wherein said category- 
content search instructions are further adapted to be 
executed by said processor to receive description informa- 
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tion for a website from a registrant, and to associate the 
website with a category based upon the website description 
information. 

17. The apparatus of claim 15, wherein said category- 
content search instructions are adapted to be executed by 
said processor to maintain a dynamic index that includes a 
list of identifiers for websites that are associated with the 
selected category, to search a representation of the content of 
each website whose identifier occurs in the dynamic index, 
and to send the results of the search to a user. 

18. The apparatus of claim 15, wherein said category- 
content search instructions are further adapted to be 
executed by said processor to dynamically generate a web 
page that reflects the results of the category -content search 
and that is adapted to be displayed to a user. 

19. The apparatus of claim 15, wherein said memory 
includes a database. 

20. The apparatus of claim 19, wherein said content- 
category search instructions are adapted to be executed by 
said processor to search, read from and write to said data- 
base. 

21. A program storage device readable by a machine, 
tangibly embodying a program of instructions executable by 
the machine to perform the method steps for searching for 
information stored at websites, the method steps comprising: 

a. retrieving website content through a network; 

b. correlating a piece of stored website content with a 
category; 

c. receiving a category selection; 

d. receiving a content search request for content in the 
selected category; and 

e. performing a content search on retrieved website con- 
tent that is correlated with the selected category. 



05/10/2004, EAST version: 1.4.1 



